This first section details an example of the code necessary for cleaning a 5W dataset in R. 5W data is typically messy and in wide format. Additionally, data wrangling is time consuming and an inefficient use of a specialist’s time. The code presented below intends to demonstrate that this monthly (or sometimes more frequent) occurence can be greatly automated, allowing for more timely and accurate analysis as well as creating the space for more tailored reporting.
# specifying column types
act_col_types <- c("date", "date", "text", "text","text","text",
"text","text","text","text","text","text","text",
"numeric", "numeric", "text","text","text",
"text","text","text", "numeric", "text",
"date", "date", "text", "numeric", "numeric",
"logical", "numeric", "numeric", "numeric",
"numeric", "numeric", "numeric", "numeric",
"numeric", "numeric", "numeric", "numeric",
"numeric", "numeric", "numeric")
# you're gonna have to live with the error messages,
# since you can't slice before read_excel() argument
act1<- read_excel("database activities 5W.xlsx",
sheet = "data",
skip = 1,
col_types = act_col_types) %>%
clean_names() %>%
remove_empty() %>%
slice(-c(1)) %>% # removes the second row
slice(-c(12055)) # removes the grand total
A
glimpse()of the renamed dataset:
## Rows: 12,054
## Columns: 43
## $ fecha_reportado_cluster <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ mes_reportado <dttm> 2020-01-31, 2020-01-31, 2020-01-31,~
## $ codigo_de_proyecto <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ nombre_del_proyecto <chr> "Todos y Todas a la Escuela", "Todos~
## $ organizacion_lider <chr> "UNICEF - Fondo de las Naciones Unid~
## $ organizacion_implementadora <chr> "SOCIO 1", "SOCIO 1", "SOCIO 1", "SO~
## $ estado <chr> "Zulia", "Zulia", "Zulia", "Distrito~
## $ pcode1 <chr> "VE23", "VE23", "VE23", "VE01", "VE0~
## $ municipio <chr> "Maracaibo", "Maracaibo", "Maracaibo~
## $ pcode2 <chr> "VE2313", "VE2313", "VE2313", "VE010~
## $ parroquia <chr> "Luis Hurtado Higuera", "Luis Hurtad~
## $ pcode3 <chr> "VE231312", "VE231312", "VE231312", ~
## $ ubicacion <chr> "E. T. Dr. Luis Razetti Fe Y Alegrí~
## $ latitud <dbl> 10.596934, 10.596934, 10.596934, 10.~
## $ longitud <dbl> -71.66896, -71.66896, -71.66896, -66~
## $ actividad_full <chr> "CLEDU/CA2.09: Formación docente y o~
## $ actividad_codigo <chr> "CA2.09", "CA2.09", "CA2.09", "CA2.0~
## $ actividad_desc <chr> "Formación docente y otro personal e~
## $ actividad_codigo_desc <chr> "CA2.09: Formación docente y otro pe~
## $ act_covid <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ unidad <chr> "#Personas", "#Personas", "#Personas~
## $ cantidad_del_mes <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ recurrente_beneficiarios <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ fecha_de_inicio <dttm> 2019-10-22, 2019-10-15, 2020-01-13,~
## $ fecha_prevista_finalizacion <dttm> 2020-01-20, 2020-01-13, 2020-01-13,~
## $ estatus_actividad <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ total_beneficiarios <dbl> 16, 13, 41, 2, 1, 25, 19, 19, 17, 17~
## $ check_beneficiaries <dbl> 16, 13, 41, 2, 1, 25, 19, 19, 17, 17~
## $ t_f_ben <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, ~
## $ percent_poblacion_indigena <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ percent_personas_con_discapacidad <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ m_0_3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ m_3_6 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ m_7_12 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ m_12_17 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ m_18_19 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ f_0_3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ f_3_6 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ f_7_12 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ f_12_17 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ f_18_19 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ f_mayores_de_19 <dbl> 11, 4, 41, 2, 1, 20, 10, 10, 5, 8, 3~
## $ m_mayores_de_19 <dbl> 5, 9, 0, 0, 0, 5, 9, 9, 12, 9, 13, 1~
Whilst it is true when some people say that if you remove tildes and other diacritical marks, it it no longer becomes Spanish. However, it must be said that implementing partners don’t care and have included a wide variety of spellings. I would say that remove all accent marks is necessary to continue working with the dataset.
# function to remove accents
rm_accent <- function(colns){
colns <- stri_trans_general(colns, "Latin-ASCII")
}
# removing accents
# and str_to_upper() since R is case-sensitive
act1 <- act1 %>%
mutate(estado = rm_accent(str_to_upper(estado)),
municipio = rm_accent(str_to_upper(municipio)),
parroquia = rm_accent(str_to_upper(parroquia)),
ubicacion = rm_accent(str_to_upper(ubicacion)),
actividad_desc = rm_accent(str_to_upper(actividad_desc)))
These should just be dropdown menus; partners should also be informed that blanks will be treated as FALSE. Additionally, a new date column just with the month is mutated, as the
mes_reportadocolumn has some months with multiple reporting dates.
# recoding
act1 <- act1 %>%
mutate(recurrente_beneficiarios =
recode(recurrente_beneficiarios,
"no" = FALSE, "No" = FALSE, "Si" = TRUE, "si" = TRUE, "sí" = TRUE, "Sí" = TRUE),
act_covid = recode(act_covid,
"no" = FALSE, "No" = FALSE, "Si" = TRUE, "si" = TRUE, "sí" = TRUE, "Sí" = TRUE)) %>%
replace_na(list(recurrente_beneficiarios = FALSE, act_covid = FALSE))
# recoding estatus_actividad
act1$estatus_actividad <- act1$estatus_actividad %>%
str_replace_all(c("En ejecucion" = "ejecucion",
"en ejecución" = "ejecucion",
"en Ejecución" = "ejecucion",
"En ejecución" = "ejecucion",
"En Ejecución" = "ejecucion",
"finalizada" = "finalizada",
"Finalizada" = "finalizada"))
# create a new variable of last day of month only
act1$mes_solo <- ceiling_date(act1$mes_reportado, "month") - days(1)
locations.csvAnd check if you need to update it.
locations <- read_csv("locations.csv") %>%
mutate(estado = rm_accent(str_to_upper(estado)), # just to make sure
municipio = rm_accent(str_to_upper(municipio)),
parroquia = rm_accent(str_to_upper(parroquia)),
ubicacion = rm_accent(str_to_upper(ubicacion)))
Maybe we should add new pcodes for “todo municipio” and “todo estado”; will they be abused?
# see if you need to update locations
# new locations not in locations.csv will be caught by the anti-join()
locations_add <- act1 %>%
select(estado, pcode1, municipio, pcode2, parroquia, pcode3, ubicacion,
latitud, longitud) %>%
distinct() %>%
anti_join(locations, by = "ubicacion")
# splitting into two datasets, one clean one dirty.
adm_dirty <- act1 %>%
filter(is.na(estado) | is.na(pcode1) |
is.na(municipio) | is.na(pcode2) |
is.na(parroquia) | is.na(pcode3))
adm_clean <- act1 %>%
filter(!is.na(estado) & !is.na(pcode1) &
!is.na(municipio) & !is.na(pcode2) &
!is.na(parroquia) & !is.na(pcode3))
coalesce()The locations reference dataset is used to clean
adm_dirtyand rewrite it. Thedistinct()call at the end is to ensure that no duplicates are included as theleft_join()argument will produce duplicates (still not sure how to control this). After which,adm_cleanandadm_dirtyare combined intoact2.
# filling in missing values
adm_dirty <- adm_dirty %>%
left_join(locations, by = "ubicacion") %>%
mutate(estado = coalesce(estado.x, estado.y),
pcode1 = coalesce(pcode1.x, pcode1.y),
municipio = coalesce(municipio.x, municipio.y),
pcode2 = coalesce(pcode2.x, pcode2.y),
parroquia = coalesce(parroquia.x, parroquia.y),
pcode3 = coalesce(pcode3.x, pcode3.y)) %>%
select(-estado.x, -estado.y, -pcode1.x, -pcode1.y,
-municipio.x, -municipio.y, -pcode2.x, -pcode2.y,
-parroquia.x, -parroquia.y, -pcode3.x, -pcode3.y) %>%
distinct() # removing duplicate rows from the join
# bind_rows() does not care about column sequence
# but put adm_clean first so that the original order is preserved
act2 <- bind_rows(adm_clean, adm_dirty)
printing total beneficiaries for
act1andact2, followed bynrow()for both.
## [1] 28965878
## [1] 28965878
## [1] 12054
## [1] 12054
This is to check that
total_beneficiariosis equal to the sum of all disaggregated columns. First, we mutate a new column by summing all the disaggregated beneficiaries and calling ittotal_ben_check. Then, we mutate a new column in thr dataset calledno_esp_benor beneficiarios no especificados so that we maintain the un-disaggregated beneficiaries without polluting the rest of the dataset.
# mutating bencheck and unspecified beneficiaries columns
act2 <- act2 %>%
mutate(total_ben_check = select(., m_0_3:m_mayores_de_19) %>% rowSums(na.rm = TRUE),
no_esp_ben = round(total_beneficiarios) - round(total_ben_check))
Next, we print rows where the sum of disaggregated beneficiaries (
total_ben_check) do not match the totals reported in thetotal_beneficiarioscolumn so that we can return to the relevant parties for follow up.
| organizacion_implementadora | actividad_desc | ubicacion | total_ben_check | total_beneficiarios | no_esp_ben |
|---|---|---|---|---|---|
| SOCIO 2 | PROMOCION MENSAJES CLAVES PARA LA COMUNIDAD ESCOLAR | U.E.C. DIVINA PASTORA (CATEDRAL) | 0 | 244 | 244 |
| SOCIO 4 | INICIATIVAS PARA REINSERCION EDUCATIVA DE NNA FUERA DE LA ESCUELA | U.E. SAN JOSE DE COTIZA | 18 | 26 | 8 |
| SOCIO 6 | FORMACION DOCENTE Y OTRO PERSONAL EDUCATIVO | MINISTERIO DEL PODER POPULAR PARA LA EDUCACION | 0 | 398 | 398 |
| SOCIO 6 | FORMACION DOCENTE Y OTRO PERSONAL EDUCATIVO | CLUSTER DE EDUCACION | 0 | 146 | 146 |
# timestamp for all the files to be written below
now <- Sys.time()
For submissions to OCHA, this one aligns with their format
# turn on eval if you want the file
filename_wide <- paste0(format(now, "%y%m%d_%H%M_"), "5W_wide.csv")
write_csv(act2, filename_wide)
This section pivots the dataset longer into tidy data, with one observation per instance of age and sex; this is necessary for easy plotting and analysis. Additionally, rows with no beneficiaries are filtered out.
The dataset
u_benis created by taking the highest number of beneficiaries by age and sex group by location of all non-recurrent beneficiaries. The columns are then filtered down to create a more usable dataset. We will use this dataset for any operations related to unique beneficiaries (individuals), including comparisons with census data to determine reach and coverage as well as geographic analysis. We drop activity information from this dataset as it only contains the max of per disaggregation group per location anyway.
# pivoting longer and creating new dataframe
u_ben <- act2 %>%
select(-t_f_ben, -total_ben_check) %>% # remember to drop them
relocate(mes_solo) %>% # moves mes_solo to the first variable
pivot_longer(m_0_3:no_esp_ben,
names_to = "desagregacion", values_to = "beneficiarios") %>%
filter(recurrente_beneficiarios == FALSE) %>%
filter(beneficiarios != 0) %>% # empty cells are 0 in the 5W table
group_by(ubicacion, desagregacion) %>%
slice(which.max(beneficiarios)) %>%
ungroup() %>%
select(mes_solo, estado, pcode1, municipio, pcode2,parroquia, pcode3,
ubicacion, latitud, longitud,
desagregacion, beneficiarios)
# writing csv of u_ben with datestamp
# turn on eval if you want the file
filename_u_ben <- paste0(format(now, "%y%m%d_%H%M_"), "u_ben.csv")
write_csv(u_ben, filename_u_ben)
The dataset
act_benis for activity-wise analysis. Only the rows marked asrecurrente_beneficiarios == FALSEhave been selected. The first instance of recurring beneficiaires is marked FALSE, with all subsequent entries being marked TRUE. Similar tou_ben, it has also been pivoted longer into tidy data.
Please not that there is double counting in this dataset as specific beneficiaries might been reached by more than one activity. For geographic analysis, please use
u_ben.
# This is the dataset for beneficiaries
act_ben <- act2 %>%
select(-t_f_ben, -total_ben_check) %>% # remember to drop them
relocate(mes_solo) %>% # moves mes_solo to the first variable
filter(recurrente_beneficiarios == FALSE) %>%
pivot_longer(m_0_3:no_esp_ben,
names_to = "desagregacion", values_to = "beneficiarios") %>%
filter(beneficiarios != 0) %>% # empty cells are 0 in the 5W table
group_by(ubicacion, desagregacion) %>%
select(mes_solo, nombre_del_proyecto, organizacion_implementadora,
estado, pcode1, municipio, pcode2, parroquia, pcode3,
ubicacion, latitud, longitud,
actividad_codigo, actividad_desc, act_covid,
desagregacion, beneficiarios)
# writing csv act_ben with datestamp
# turn on eval if you want the file
filename_act_ben <- paste0(format(now, "%y%m%d_%H%M_"), "act_ben.csv")
write_csv(act_ben, filename_act_ben)
u_ben, act_ben and differencePlease note that this still has that radio messaging activity lumped in with the totals.
## [1] "3,660,609"
## [1] "5,673,040"
## [1] "2,012,431"
This is an entirely automated report – all charts and tables, as well as all figures within the report have been generated from the data, with no manual input. This report makes use of the outputs of the
5W_cleaningsection above. This set of 5W data pertains to the Education Cluster in Venezuela and has had partner information removed. Code will not be shown in this section – you can download the Rmd at the top-left corner of this doucument.
| actividad | total | percent_of_total | male | female | sex_ratio |
|---|---|---|---|---|---|
| DISTRIBUCION DE KITS DE MATERIALES ESCOLARES | 471,568 | 50.91 | 232,197 | 239,372 | 0.97 |
| ALIMENTACION ESCOLAR | 156,472 | 16.89 | 72,035 | 84,437 | 0.85 |
| EDUCACION A DISTANCIA | 132,258 | 14.28 | 64,240 | 68,018 | 0.94 |
| APOYO PSICOEDUCATIVO PARA NNA | 120,887 | 13.05 | 54,859 | 66,028 | 0.83 |
| FORMACION DOCENTE Y OTRO PERSONAL EDUCATIVO | 17,238 | 1.86 | 3,470 | 13,224 | 0.26 |
| ACTIVIDADES CON ADOLESCENTES Y JOVENES DE NIVELACION, HABILIDADES PARA LA VIDA Y CAPACITACION TECNICA | 14,016 | 1.51 | 6,336 | 7,680 | 0.82 |
| BECAS Y OTROS INCENTIVOS PARA DOCENTES Y PERSONAL | 5,744 | 0.62 | 1,572 | 4,172 | 0.38 |
| ACTIVIDADES RECREATIVAS | 4,317 | 0.47 | 1,998 | 2,319 | 0.86 |
| INICIATIVAS PARA REINSERCION EDUCATIVA DE NNA FUERA DE LA ESCUELA | 3,869 | 0.42 | 2,114 | 1,755 | 1.2 |
A total of 728,408 individuals have been reached to date. In terms of frequencies (inclusive of double counting), 926,369 have been reached.
Additionally, the 4,746,671 beneficiary frequencies reached by the activity PROMOCION MENSAJES CLAVES PARA LA COMUNIDAD ESCOLAR have been removed from the totals in this report as the activity consists of solely radio messaging.
figures are unique beneficiaries/individuals
With reference to the 2017 Matricula dataset, we can see that the Education programme has reached an overall average of 9% of schoolgoing children aged 3-17 nationwide. Children aged 3-17 consitute 88% of all UNICEF beneficiaries.
| Edad grupo | beneficiarios | matricula2017 | percent_total |
|---|---|---|---|
| 3-6 | 121,169 | 1,438,475 | 8.423 |
| 7-12 | 352,808 | 3,252,505 | 10.85 |
| 12-17 | 169,771 | 2,205,724 | 7.697 |
The number of individuals reached has increased by 175,005 in the past month, reaching a total of 728,408. The number of beneficiary frequencies reached has increased by 218,443 in the same period, reaching a total of 926,369.
mouse over to see details
Progress in recent months has largely been due to the distribution of education kits and distance learning.
A total of 2,229 schools have been reached by UNICEF; 44% are from Miranda and Zulia alone.
logarithmic scale; larger points indicate more beneficiaries reached, darker blues indicate more activity types
mouse over municipalities to see beneficiaries and distinct activities
A total of 110 municipalities were reached by the UNICEF Education programme.
| estado | municipio | beneficiarios |
|---|---|---|
| DISTRITO CAPITAL | LIBERTADOR | 80482 |
| MIRANDA | SUCRE | 59176 |
| ZULIA | MARACAIBO | 55370 |
| BOLIVAR | CARONI | 37908 |
| ZULIA | SAN FRANCISCO | 29369 |
| AMAZONAS | ATURES | 22430 |
| LARA | IRIBARREN | 21977 |
| BOLIVAR | HERES | 18293 |
| DELTA AMACURO | TUCUPITA | 17953 |
| BOLIVAR | CEDENO | 15681 |
| estado | municipio | coverage_percent |
|---|---|---|
| TACHIRA | FERNANDEZ FEO | 87 |
| TACHIRA | AYACUCHO | 79 |
| ZULIA | MACHIQUES DE PERIJA | 79 |
| AMAZONAS | AUTONOMO AUTANA | 76 |
| TACHIRA | SAMUEL DARIO MALDONADO | 75 |
| TACHIRA | PANAMERICANO | 72 |
| MIRANDA | PLAZA | 68 |
| TACHIRA | INDEPENDENCIA | 68 |
| TACHIRA | JUNIN | 65 |
| MIRANDA | EL HATILLO | 64 |
Together, the 10 municipalities with the highest reach (above left) form 49% of the 728,408 beneficiaries reached. The average coverage of the school-age population in the municipalities where UNICEF is present is 20%. Coverage refers to the percentage of enrolled children (aged 3-17 years) reached by UNICEF.
Below is a histogram of munciipalities where UNICEF is present showing the coverage of enrolled children (aged 3-17). Of note, we have reached 10% or less of the population in 55 out of the 110 in which we operate. This is in addition to the 226 where no UNICEF Education activities have occurred.
| partner | SOCIO 2 | SOCIO 1 | SOCIO 4 | SOCIO 5 | UNICEF | SOCIO 3 | SOCIO 6 | SOCIO 7 | SOCIO 10 | SOCIO 8 |
| act_types | 9 | 8 | 8 | 6 | 6 | 4 | 2 | 2 | 1 | 1 |
mouse over for details
| organizacion_implementadora | beneficiarios | percent_of_total | male | female | sex_ratio | municipalities |
|---|---|---|---|---|---|---|
| UNICEF | 353,944 | 38.21 | 168,281 | 185,664 | 0.91 | 74 |
| SOCIO 1 | 160,260 | 17.3 | 79,958 | 80,302 | 1 | 92 |
| SOCIO 2 | 145,997 | 15.76 | 65,390 | 80,607 | 0.81 | 51 |
| SOCIO 5 | 113,845 | 12.29 | 50,511 | 63,334 | 0.8 | 10 |
| SOCIO 4 | 86,074 | 9.29 | 42,534 | 43,540 | 0.98 | 17 |
| SOCIO 7 | 31,322 | 3.38 | 14,750 | 16,572 | 0.89 | 10 |
| SOCIO 8 | 29,471 | 3.18 | 15,379 | 14,092 | 1.09 | 17 |
| SOCIO 10 | 2,461 | 0.27 | 1,151 | 1,310 | 0.88 | 1 |
| SOCIO 3 | 2,423 | 0.26 | 862 | 1,561 | 0.55 | 3 |
| SOCIO 6 | 572 | 0.06 | 5 | 23 | 0.22 | 7 |
use UNICEF_present to filter to municipalities where the Education programme operates
CA01.05 Promocion de mensajes claves para la comunidad escolar is not included